<div class="tmd-doc">
<h1 class="tmd-header-1">
Strings in Go
</h1>
<p></p>
<div class="tmd-usual">
Like many other programming languages, string is also one important kind of types in Go. This article will list all the facts of strings.
</div>
<p></p>
<h3 class="tmd-header-3">
The Internal Structure of String Types
</h3>
<p></p>
<div class="tmd-usual">
For the standard Go compiler, the internal structure of any string type is declared like:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">type _string struct {
	elements *byte // underlying bytes
	len      int   // number of bytes
}
</code></pre>
<div class="tmd-usual">
From the declaration, we know that a string is actually a <code class="tmd-code-span">byte</code> sequence wrapper. In fact, we can really view a string as an (element-immutable) byte slice.
</div>
<p></p>
<div class="tmd-usual">
Note, in Go, <code class="tmd-code-span">byte</code> is a built-in alias of type <code class="tmd-code-span">uint8</code>.
</div>
<p></p>
<h3 class="tmd-header-3">
Some Simple Facts About Strings
</h3>
<p></p>
<div class="tmd-usual">
We have learned the following facts about strings from previous articles.
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
String values can be used as constants (along with boolean and all kinds of numeric values).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
Go supports <a href="basic-types-and-value-literals.html#string-literals">two styles of string literals</a>, the double-quote style (or interpreted literals) and the back-quote style (or raw string literals).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
The zero values of string types are blank strings, which can be represented with <code class="tmd-code-span">""</code> or <code class="tmd-code-span">``</code> in literal.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
Strings can be concatenated with <code class="tmd-code-span">+</code> and <code class="tmd-code-span">+=</code> operators.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
String types are all comparable (by using the <code class="tmd-code-span">==</code> and <code class="tmd-code-span">!=</code> operators). And like integer and floating-point values, two values of the same string type can also be compared with <code class="tmd-code-span">&gt;</code>, <code class="tmd-code-span">&lt;</code>, <code class="tmd-code-span">&gt;=</code> and <code class="tmd-code-span">&lt;=</code> operators. When comparing two strings, their underlying bytes will be compared, one byte by one byte. If one string is a prefix of the other one and the other one is longer, then the other one will be viewed as the larger one.
</div>
</li>
</ul>
<p></p>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

func main() {
	const World = "world"
	var hello = "hello"

	// Concatenate strings.
	var helloWorld = hello + " " + World
	helloWorld += "!"
	fmt.Println(helloWorld) // hello world!

	// Compare strings.
	fmt.Println(hello == "hello")   // true
	fmt.Println(hello &gt; helloWorld) // false
}
</code></pre>
<p></p>
<div class="tmd-usual">
More facts about string types and values in Go.
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
Like Java, the contents (underlying bytes) of string values are immutable. The lengths of string values also can't be modified separately. An addressable string value can only be overwritten as a whole by assigning another string value to it.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
The built-in <code class="tmd-code-span">string</code> type has no methods (just like most other built-in types in Go), but we can
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
use functions provided in the <a href="https://golang.org/pkg/strings/"><code class="tmd-code-span">strings</code> standard package</a> to do all kinds of string manipulations.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
call the built-in <code class="tmd-code-span">len</code> function to get the length of a string (number of bytes stored in the string).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
use the element access syntax <code class="tmd-code-span">aString[i]</code> introduced in <a href="container.html#element-access">container element accesses</a> to get the <span class="tmd-bold">i</span><span class="tmd-italic">th</span> <code class="tmd-code-span">byte</code> value stored in <code class="tmd-code-span">aString</code>. The expression <code class="tmd-code-span">aString[i]</code> is not addressable. In other words, value <code class="tmd-code-span">aString[i]</code> can't be modified.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
use <a href="container.html#subslice">the subslice syntax</a> <code class="tmd-code-span">aString[start:end]</code> to get a substring of <code class="tmd-code-span">aString</code>. Here, <code class="tmd-code-span">start</code> and <code class="tmd-code-span">end</code> are both indexes of bytes stored in <code class="tmd-code-span">aString</code>.
</div>
</li>
</ul>
<p></p>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
For the standard Go compiler, the destination string variable and source string value in a string assignment will share the same underlying byte sequence in memory. The result of a substring expression <code class="tmd-code-span">aString[start:end]</code> also shares the same underlying byte sequence with the base string <code class="tmd-code-span">aString</code> in memory.
</div>
</li>
</ul>
<p></p>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import (
	"fmt"
	"strings"
)

func main() {
	var helloWorld = "hello world!"

	var hello = helloWorld[:5] // substring
	// 104 is the ASCII code (and Unicode) of char 'h'.
	fmt.Println(hello[0])         // 104
	fmt.Printf("%T \n", hello[0]) // uint8

	// hello[0] is unaddressable and immutable,
	// so the following two lines fail to compile.
	/*
	hello[0] = 'H'         // error
	fmt.Println(&amp;hello[0]) // error
	*/

	// The next statement prints: 5 12 true
	fmt.Println(len(hello), len(helloWorld),
			strings.HasPrefix(helloWorld, hello))
}
</code></pre>
<p></p>
<div class="tmd-usual">
Note, if <code class="tmd-code-span">aString</code> and the indexes in expressions <code class="tmd-code-span">aString[i]</code> and <code class="tmd-code-span">aString[start:end]</code> are all constants, then out-of-range constant indexes will make compilations fail. And please note that the evaluation results of such expressions are always non-constants (<a href="https://github.com/golang/go/issues/28591">this might or might not change since a later Go version</a>). For example, the following program will print <code class="tmd-code-span">4 0</code>.
</div>
<p></p>
<p></p>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

const s = "Go101.org" // len(s) == 9

// len(s) is a constant expression,
// whereas len(s[:]) is not.
var a byte = 1 &lt;&lt; len(s) / 128
var b byte = 1 &lt;&lt; len(s[:]) / 128

func main() {
	fmt.Println(a, b) // 4 0
}
</code></pre>
<p></p>
<div class="tmd-usual">
About why variables <code class="tmd-code-span">a</code> and <code class="tmd-code-span">b</code> are evaluated to different values, please read <a href="operators.html#bitwise-shift-left-operand-type-deduction">the special type deduction rule in bitwise shift operator operation</a> and <a href="summaries.html#compile-time-evaluation">which function calls are evaluated at compile time</a>.
</div>
<p></p>
<p></p>
<h3 class="tmd-header-3">
String Encoding and Unicode Code Points
</h3>
<p></p>
<div class="tmd-usual">
Unicode standard specifies a unique value for each character in all kinds of human languages. But the basic unit in Unicode is not character, it is code point instead. For most code points, each of them corresponds to a character, but for a few characters, each of them consists of several code points.
</div>
<p></p>
<div class="tmd-usual">
Code points are represented as <a href="basic-types-and-value-literals.html#rune">rune values</a> in Go. In Go, <code class="tmd-code-span">rune</code> is a built-in alias of type <code class="tmd-code-span">int32</code>.
</div>
<p></p>
<p></p>
<div class="tmd-usual">
In applications, there are several encoding methods to represent code points, such as UTF-8 encoding and UTF-16 encoding. Nowadays, the most popularly used encoding method is UTF-8 encoding. In Go, all string constants are viewed as UTF-8 encoded. At compile time, illegal UTF-8 encoded string constants will make compilation fail. However, at run time, Go runtime can't prevent some strings from being illegally UTF-8 encoded.
</div>
<p></p>
<div class="tmd-usual">
For UTF-8 encoding, each code point value may be stored as one or more bytes (up to four bytes). For example, each English code point (which corresponds to one English character) is stored as one byte, however each Chinese code point (which corresponds to one Chinese character) is stored as three bytes.
</div>
<p></p>
<h3 id="conversions" class="tmd-header-3">
String Related Conversions
</h3>
<p></p>
<div class="tmd-usual">
In the article <a href="constants-and-variables.html#explicit-conversion">constants and variables</a>, we have learned that integers can be explicitly converted to strings (but not vice versa).
</div>
<p></p>
<p></p>
<div class="tmd-usual">
Here introduces two more string related conversions rules in Go:
</div>
<ol class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
a string value can be explicitly converted to a byte slice, and vice versa. A byte slice is a slice with element type's underlying type as <code class="tmd-code-span">[]byte</code>.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
a string value can be explicitly converted to a rune slice, and vice versa. A rune slice is a slice whose element type's underlying type as <code class="tmd-code-span">[]rune</code>.
</div>
</li>
</ol>
<p></p>
<div class="tmd-usual">
In a conversion from a rune slice to string, each slice element (a rune value) will be UTF-8 encoded as from one to four bytes and stored in the result string. If a slice rune element value is outside the range of valid Unicode code points, then it will be viewed as <code class="tmd-code-span">0xFFFD</code>, the code point for the Unicode replacement character. <code class="tmd-code-span">0xFFFD</code> will be UTF-8 encoded as three bytes (<code class="tmd-code-span">0xef 0xbf 0xbd</code>).
</div>
<p></p>
<div class="tmd-usual">
When a string is converted to a rune slice, the bytes stored in the string will be viewed as successive UTF-8 encoding byte sequence representations of many Unicode code points. Bad UTF-8 encoding representations will be converted to a rune value <code class="tmd-code-span">0xFFFD</code>.
</div>
<p></p>
<div class="tmd-usual">
When a string is converted to a byte slice, the result byte slice is just a deep copy of the underlying byte sequence of the string. When a byte slice is converted to a string, the underlying byte sequence of the result string is also just a deep copy of the byte slice. A memory allocation is needed to store the deep copy in each of such conversions. The reason why a deep copy is essential is slice elements are mutable but the bytes stored in strings are immutable, so a byte slice and a string can't share byte elements.
</div>
<p></p>
<div class="tmd-usual">
Please note, for conversions between strings and byte slices,
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
illegal UTF-8 encoded bytes are allowed and will keep unchanged.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the standard Go compiler makes some optimizations for some special cases of such conversions, so that the deep copies are not made. Such cases will be introduced below.
</div>
</li>
</ul>
<p></p>
<div class="tmd-usual">
Conversions between byte slices and rune slices are not supported directly in Go, We can use the following ways to achieve this goal:
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
use string values as a hop. This way is convenient but not very efficient, for two deep copies are needed in the process.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
use the functions in <a href="https://golang.org/pkg/unicode/utf8/">unicode/utf8</a> standard package.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
use <a href="https://golang.org/pkg/bytes/#Runes">the <code class="tmd-code-span">Runes</code> function in the bytes standard package</a> to convert a <code class="tmd-code-span">[]byte</code> value to a <code class="tmd-code-span">[]rune</code> value. There is not a function in this package to convert a rune slice to byte slice.
</div>
</li>
</ul>
<p></p>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import (
	"bytes"
	"unicode/utf8"
)

func Runes2Bytes(rs []rune) []byte {
	n := 0
	for _, r := range rs {
		n += utf8.RuneLen(r)
	}
	n, bs := 0, make([]byte, n)
	for _, r := range rs {
		n += utf8.EncodeRune(bs[n:], r)
	}
	return bs
}

func main() {
	s := "Color Infection is a fun game."
	bs := []byte(s) // string -&gt; []byte
	s = string(bs)  // []byte -&gt; string
	rs := []rune(s) // string -&gt; []rune
	s = string(rs)  // []rune -&gt; string
	rs = bytes.Runes(bs) // []byte -&gt; []rune
	bs = Runes2Bytes(rs) // []rune -&gt; []byte
}
</code></pre>
<p></p>
<p></p>
<h3 id="conversion-optimizations" class="tmd-header-3">
Compiler Optimizations for Conversions Between Strings and Byte Slices
</h3>
<p></p>
<div class="tmd-usual">
Please read the "String and Byte Slices" chapter in the <a href="https://go101.org/optimizations/101.html">Go Optimizations 101</a> book.
</div>
<p></p>
<p></p>
<p></p>
<h3 class="tmd-header-3">
<code class="tmd-code-span">for-range</code> on Strings
</h3>
<p></p>
<div class="tmd-usual">
The <code class="tmd-code-span">for-range</code> loop control flow applies to strings. But please note, <code class="tmd-code-span">for-range</code> will iterate the Unicode code points (as <code class="tmd-code-span">rune</code> values), instead of bytes, in a string. Bad UTF-8 encoding representations in the string will be interpreted as <code class="tmd-code-span">rune</code> value <code class="tmd-code-span">0xFFFD</code>.
</div>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i, rn := range s {
		fmt.Printf("%2v: 0x%x %v \n", i, rn, string(rn))
	}
	fmt.Println(len(s))
}
</code></pre>
<p></p>
<div class="tmd-usual">
The output of the above program:
</div>
<pre class="tmd-code output">
 0: 0x65 e
 1: 0x301 ́
 3: 0x915 क
 6: 0x94d ्
 9: 0x937 ष
12: 0x93f ि
15: 0x61 a
16: 0x3c0 π
18: 0x56e7 囧
21
</pre>
<p></p>
<div class="tmd-usual">
From the output result, we can find that
</div>
<p></p>
<ol class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
the iteration index value may be not continuous. The reason is the index is the byte index in the ranged string and one code point may need more than one byte to represent.
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the first character, <code class="tmd-code-span">é</code>, is composed of two runes (3 bytes total)
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the second character, <code class="tmd-code-span">क्षि</code>, is composed of four runes (12 bytes total).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the English character, <code class="tmd-code-span">a</code>, is composed of one rune (1 byte).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the character, <code class="tmd-code-span">π</code>, is composed of one rune (2 bytes).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
the Chinese character, <code class="tmd-code-span">囧</code>, is composed of one rune (3 bytes).
</div>
</li>
</ol>
<p></p>
<div class="tmd-usual">
Then how to iterate bytes in a string? Do this:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i := 0; i &lt; len(s); i++ {
		fmt.Printf("The byte at index %v: 0x%x \n", i, s[i])
	}
}
</code></pre>
<p></p>
<div class="tmd-usual">
Surely, we can also make use of the compiler optimization mentioned above to iterate bytes in a string. For the standard Go compiler, this way is a little more efficient than the above one.
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	// Here, the underlying bytes of s are not copied.
	for i, b := range []byte(s) {
		fmt.Printf("The byte at index %v: 0x%x \n", i, b)
	}
}
</code></pre>
<p></p>
<div class="tmd-usual">
From the above several examples, we know that <code class="tmd-code-span">len(s)</code> will return the number of bytes in string <code class="tmd-code-span">s</code>. The time complexity of <code class="tmd-code-span">len(s)</code> is <span class="tmd-italic"><code class="tmd-code-span">O(1)</code></span>. How to get the number of runes in a string? Using a <code class="tmd-code-span">for-range</code> loop to iterate and count all runes is a way, and using the <a href="https://golang.org/pkg/unicode/utf8/#RuneCountInString">RuneCountInString</a> function in the <code class="tmd-code-span">unicode/utf8</code> standard package is another way. The efficiencies of the two ways are almost the same. The third way is to use <code class="tmd-code-span">len([]rune(s))</code> to get the count of runes in string <code class="tmd-code-span">s</code>. Since Go Toolchain 1.11, the standard Go compiler makes an optimization for the third way to avoid an unnecessary deep copy so that it is as efficient as the former two ways. Please note that the time complexities of these ways are all <span class="tmd-italic"><code class="tmd-code-span">O(n)</code></span>.
</div>
<p></p>
<p></p>
<h3 id="string-concatenation" class="tmd-header-3">
More String Concatenation Methods
</h3>
<p></p>
<div class="tmd-usual">
Please read the "String and Byte Slices" chapter in the <a href="https://go101.org/optimizations/101.html">Go Optimizations 101</a> book.
</div>
<p></p>
<p></p>
<h3 id="use-string-as-byte-slice" class="tmd-header-3">
Sugar: Use Strings as Byte Slices
</h3>
<p></p>
<div class="tmd-usual">
From the article <a href="container.html">arrays, slices and maps</a>, we have learned that we can use the built-in <code class="tmd-code-span">copy</code> and <code class="tmd-code-span">append</code> functions to copy and append slice elements. In fact, as a special case, if the first argument of a call to either of the two functions is a byte slice, then the second argument can be a string (if the call is an <code class="tmd-code-span">append</code> call, then the string argument must be followed by three dots <code class="tmd-code-span">...</code>). In other words, a string can be used as a byte slice for the special case.
</div>
<p></p>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import "fmt"

func main() {
	hello := []byte("Hello ")
	world := "world!"

	// The normal way:
	// helloWorld := append(hello, []byte(world)...)
	helloWorld := append(hello, world...) // sugar way
	fmt.Println(string(helloWorld))

	helloWorld2 := make([]byte, len(hello) + len(world))
	copy(helloWorld2, hello)
	// The normal way:
	// copy(helloWorld2[len(hello):], []byte(world))
	copy(helloWorld2[len(hello):], world) // sugar way
	fmt.Println(string(helloWorld2))
}
</code></pre>
<p></p>
<h3 id="comparison" class="tmd-header-3">
More About String Comparisons
</h3>
<p></p>
<div class="tmd-usual">
Above has mentioned that comparing two strings is comparing their underlying bytes actually. Generally, Go compilers will make the following optimizations for string comparisons.
</div>
<ul class="tmd-list">
<li class="tmd-list-item">
<div class="tmd-usual">
For <code class="tmd-code-span">==</code> and <code class="tmd-code-span">!=</code> comparisons, if the lengths of the compared two strings are not equal, then the two strings must be also not equal (no needs to compare their bytes).
</div>
</li>
<li class="tmd-list-item">
<div class="tmd-usual">
If their underlying byte sequence pointers of the compared two strings are equal, then the comparison result is the same as comparing the lengths of the two strings.
</div>
</li>
</ul>
<p></p>
<div class="tmd-usual">
So for two equal strings, the time complexity of comparing them depends on whether or not their underlying byte sequence pointers are equal. If the two are equal, then the time complexity is <span class="tmd-italic"><code class="tmd-code-span">O(1)</code></span>, otherwise, the time complexity is <span class="tmd-italic"><code class="tmd-code-span">O(n)</code></span>, where <span class="tmd-italic"><code class="tmd-code-span">n</code></span> is the length of the two strings.
</div>
<p></p>
<div class="tmd-usual">
As above mentioned, for the standard Go compiler, in a string value assignment, the destination string value and the source string value will share the same underlying byte sequence in memory. So the cost of comparing the two strings becomes very small.
</div>
<p></p>
<div class="tmd-usual">
Example:
</div>
<pre class="tmd-code line-numbers">
<code class="language-go">package main

import (
	"fmt"
	"time"
)

func main() {
	bs := make([]byte, 1&lt;&lt;26)
	s0 := string(bs)
	s1 := string(bs)
	s2 := s1

	// s0, s1 and s2 are three equal strings.
	// The underlying bytes of s0 is a copy of bs.
	// The underlying bytes of s1 is also a copy of bs.
	// The underlying bytes of s0 and s1 are two
	// different copies of bs.
	// s2 shares the same underlying bytes with s1.

	startTime := time.Now()
	_ = s0 == s1
	duration := time.Now().Sub(startTime)
	fmt.Println("duration for (s0 == s1):", duration)

	startTime = time.Now()
	_ = s1 == s2
	duration = time.Now().Sub(startTime)
	fmt.Println("duration for (s1 == s2):", duration)
}
</code></pre>
<p></p>
<div class="tmd-usual">
Output:
</div>
<pre class="tmd-code output">
duration for (s0 == s1): 10.462075ms
duration for (s1 == s2): 136ns
</pre>
<p></p>
<div class="tmd-usual">
1ms is 1000000ns! So please try to avoid comparing two long strings if they don't share the same underlying byte sequence.
</div>
<p></p>
</div>
